outline

  1. structure
  2. Data: dataset, preprocessing, visualisation,
  3. priminarly examination: paired correlation and spatial correlation, piared R squared, scatterplot
  4. machine learning methods: model parameter tunning, interpretation
This tutorial shows from data exploration to the modelling process for the global air pollution modelling project. The statistical learning methods used include Lasso, random forest, stochastic gradient boosting, extreme gradient boosting. The partial dependence plot and variable importance are visualised to partly interpretate models.

Required packages

ipak <- function(pkg){
 
   new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
   if (length(new.pkg)) 
    install.packages(new.pkg, dependencies = TRUE)
  sapply(pkg, require, character.only = TRUE)
}
#repos='http://cran.muenster.r-project.org'

stdata = c("sp", "sf", "raster")
Stat_methods = c("glmnet", "ranger", "gbm", "xgboost", "party", "caret", "gstat")
visual = c("RColorBrewer", "ggplot2", "corrplot", "tmap", "leaflet", "mapview","leafem", "pdp", "vip", "DT", "sparkline")
map = c("maptools")
tidy = c("devtools", "dplyr",  "tidyr", "reshape2", "knitr")
other = c("countrycode", "htmlwidgets", "data.table", "Matrix", "GGally")
 

packages <- c(stdata, tidy, Stat_methods, visual, map, other)
ipak(packages)
##           sp           sf       raster     devtools        dplyr        tidyr 
##         TRUE         TRUE         TRUE         TRUE         TRUE         TRUE 
##     reshape2        knitr       glmnet       ranger          gbm      xgboost 
##         TRUE         TRUE         TRUE         TRUE         TRUE         TRUE 
##        party        caret        gstat RColorBrewer      ggplot2     corrplot 
##         TRUE         TRUE         TRUE         TRUE         TRUE         TRUE 
##         tmap      leaflet      mapview       leafem          pdp          vip 
##         TRUE         TRUE         TRUE         TRUE         TRUE         TRUE 
##           DT    sparkline     maptools  countrycode  htmlwidgets   data.table 
##         TRUE         TRUE         TRUE         TRUE         TRUE         TRUE 
##       Matrix       GGally 
##         TRUE         TRUE

Auxilary package with preprocessed data in dataframe.

install_github("mengluchu/APMtools")
library(APMtools)
ls("package:APMtools")
##  [1] "Brt_imp"              "Brt_LUR"              "Brt_pre"             
##  [4] "cforest_LUR"          "countrywithppm"       "create_ring"         
##  [7] "ctree_LUR"            "DENL_new"             "error_matrix"        
## [10] "join_by_id"           "Lasso"                "Lasso_pre"           
## [13] "Lassoselected"        "mechanical"           "merge_roads"         
## [16] "merged"               "merged_nightlight"    "mergeraster2file"    
## [19] "plot_error"           "plot_rsq"             "ppm2ug"              
## [22] "predicLA_RF_XGBtiles" "RDring_coef"          "removedips"          
## [25] "retrieve_predictor"   "rf_imp"               "rf_LUR"              
## [28] "rf_pre"               "sampledf"             "scatterplot"         
## [31] "subset_grep"          "univar_rsq"           "xgb_pre"             
## [34] "xgboost_imp"          "xgboost_LUR"

If the above is not successeful for MacOS users, the following and a restart of R may be needed

gd = fread("~/Documents/GitHub/Global mapping/2020_06_world/stations_20200602.csv")
avg = fread("~/Documents/GitHub/Global mapping/oaqEUAUCAUS.csv")
gdata = merge(gd, avg, by.x = c("long", "lat"), by.y = c("LONGITUDE","LATITUDE" ))
g1 = na_if(gdata, -9999.9)
g2 = g1%>%dplyr::select(-id, -dir, -V1, -Rsp)%>%filter(value_mean >0)  

For the old Mac machines

system('defaults write org.R-project.R force.LANG en_US.UTF-8')

Predictor variables:

vastring = "road|nightlight|population|temp|wind|trop|indu|elev"

Dataset:

  • value_mean: annual mean \(NO_2\) (\(mg/m^3\)). ( for the dataset merged.rda:
  • day_value: annual mean \(NO_2\) (\(mg/m^3\)) at daytime.
  • night_value: annual mean \(NO_2\) (\(mg/m^3\)) at night time.
    )

1. road_class_XX_size: road lenght within a buffer with radius “size” of type XX. ROAD_1: highway, ROAD_2: primary, ROAD_3: secondary, ROAD_4: tertiary, ROAD_5: unpaved
2. industry_size: Industrial area within a buffer with radius “size”.
3. trop_mean: TROPOMI averaged over Feb 2018 - Jan 2019.
4. temperature_2m_m: monthly mean temperature at 2m height of month “m”.
5. wind_speed_10m_m:monthly mean wind speed at 2m height of month “m”.
6. poppulation_1000/ 3000 /5000: population 1, 3, 5 km resolution.
7. Rsp: Surface remote sensing and chemical transport model product (only for 2012).
8. OMI_mean_filt: OMI column density, 2017 annual average.
9. nightlight_size: nightlight VIIRS data in original resolution (500 m) and various buffer sizes.

NO2 annual concentration, all the units are converted to ug/m3. In the data display later you can see countries with different units.

g2 %>% dplyr::select(value_mean ) %>% summary()
##    value_mean      
##  Min.   : 0.00022  
##  1st Qu.:13.85912  
##  Median :22.48644  
##  Mean   :24.37987  
##  3rd Qu.:32.65573  
##  Max.   :98.34452
#datatable(g2, rownames = FALSE, filter = "top", options = list(pageLength = 5, scrollX = T))

Merge roads of different road types, here 3, 4, 5, the road length of these road types are aggregated. The original road types are substituted (with keep = T, they are remained).

merged_mr = merge_roads(g2, c(3, 4, 5), keep = F) # keep = T keeps the original roads. 
names(merged_mr)
##  [1] "road_class_M345_1000" "road_class_M345_100"  "road_class_M345_25"  
##  [4] "road_class_M345_3000" "road_class_M345_300"  "road_class_M345_5000"
##  [7] "road_class_M345_500"  "road_class_M345_50"   "road_class_M345_800" 
## [10] "long"                 "lat"                  "nightlight_800"      
## [13] "nightlight_500"       "nightlight_5000"      "nightlight_3000"     
## [16] "nightlight_1000"      "elevation"            "industry_1000"       
## [19] "industry_100"         "industry_25"          "industry_3000"       
## [22] "industry_300"         "industry_5000"        "industry_500"        
## [25] "industry_50"          "industry_800"         "OMI_mean_filt"       
## [28] "population_1000"      "population_3000"      "population_5000"     
## [31] "road_class_1_1000"    "road_class_1_100"     "road_class_1_25"     
## [34] "road_class_1_3000"    "road_class_1_300"     "road_class_1_5000"   
## [37] "road_class_1_500"     "road_class_1_50"      "road_class_1_800"    
## [40] "road_class_2_1000"    "road_class_2_100"     "road_class_2_25"     
## [43] "road_class_2_3000"    "road_class_2_300"     "road_class_2_5000"   
## [46] "road_class_2_500"     "road_class_2_50"      "road_class_2_800"    
## [49] "temperature_2m_10"    "temperature_2m_11"    "temperature_2m_12"   
## [52] "temperature_2m_1"     "temperature_2m_2"     "temperature_2m_3"    
## [55] "temperature_2m_4"     "temperature_2m_5"     "temperature_2m_6"    
## [58] "temperature_2m_7"     "temperature_2m_8"     "temperature_2m_9"    
## [61] "trop_mean_filt"       "wind_speed_10m_10"    "wind_speed_10m_11"   
## [64] "wind_speed_10m_12"    "wind_speed_10m_1"     "wind_speed_10m_2"    
## [67] "wind_speed_10m_3"     "wind_speed_10m_4"     "wind_speed_10m_5"    
## [70] "wind_speed_10m_6"     "wind_speed_10m_7"     "wind_speed_10m_8"    
## [73] "wind_speed_10m_9"     "value_mean"           "country"
#numeric country
#inde_var$country=as.numeric(inde_var$country)

Visualization

Visualize with tmap: convenient

locations_sf = st_as_sf(merged_mr, coords = c("long","lat"))
osm_valuemean = tm_shape(locations_sf) +
  tm_dots( "value_mean", col = "value_mean", size = 0.05,title = "NO2 value",
     popup.vars = c("value_mean" )) + tm_view(basemaps = c('OpenStreetMap'))
#+tm_shape(lnd)+tm_lines()
tmap_save(osm_valuemean, "NO2mean.html")

Visualize with leaflet: more control. show Day/night ratio, red: day/night >1, blue, day/nigh <1

merged_fp = merged_mr %>% mutate(ratiodn = day_value/night_value) %>% mutate(color = ifelse(ratiodn > 1, "red", "blue"))
m  = leaflet(merged_fp) %>%
     addTiles() %>% addCircleMarkers(radius = ~value_mean/5, color = ~color, popup =   ~as.character(value_mean),fill = FALSE) %>% addProviderTiles(providers$Esri.NatGeoWorldMap) %>% addMouseCoordinates() %>%  
addHomeButton(ext = extent(116.2, 117, 39.7, 40), layer.name = "Beijing") %>% addHomeButton(ext = extent(5, 5.2, 52, 52.2), layer.name = "Utrecht")
saveWidget(m, file = "NO2daynight.html")

Boxplot

countryname = paste(merged_mr$country, countrycode(merged_mr$country, 'iso2c', 'country.name'), sep = ":") 

#tag country with ppm 

# use the median for colour
mergedmedian = merged_mr %>% group_by(country) %>% mutate(median =  median(value_mean, na.rm = TRUE)) 
nrow(merged_mr)
## [1] 5521
merged_mr = merged_mr %>% group_by(country) %>% mutate(count1 = n())
 merged_mr$count1
##    [1]  456  456  456  190  190  190  190  190  190  190  190  456  456  190
##   [15]  190  190  190  190  190  190  190  190  190  190  190  190  190  190
##   [29]  190  190  190  190  190  456  190  456  190  190  190  456  190  190
##   [43]  456  456  190  456  456  456  190  456  190  190  456  456  456  456
##   [57]  456  456  456  456  190  456  190  456  456  456  456  456  456  190
##   [71]  456  456  456  456  456  456  190  456  456  456  456  456  456  456
##   [85]  456  190  456  456  456  456  456  456  456  456  190  456  456  456
##   [99]  456  456  456  456  456  456  456  456  456  456  456  456  456  190
##  [113]  456  190  456  456  190  456  456  456  456  190  456  456  456  456
##  [127]  456  456  456  456  456  456  456  456  456  456  456  456  456  456
##  [141]  456  190  456  456  190  456  456  456  456  456  456  456  456  456
##  [155]  456  190  456  456  456  456  456  456  456  456  456  456  190  456
##  [169]  190  456  456  456  456  456  190  456  456  456  190  190  190  190
##  [183]  190  190  190  190  190  190  190  190  190  190  190  190  190  190
##  [197]  190  190  456  190  190  190  190  190  190  190  190  190  456  456
##  [211]  456  456  456  456  456  456  456  190  456  456  456  456  190  190
##  [225]  190  190  190  190  190  456  456  190  190  456  456  456  190  190
##  [239]  456  456  456  456  456  456  456  456  456  456  456  456  456  456
##  [253]  456  456  456  190  456  456  456  456  456  456  456  190  456  456
##  [267]  456  456  456  456  456  456  456  190  456  456  456  456  456  456
##  [281]  456  456  456  456  456  456  456  190  456  456  456  456  456  456
##  [295]  456  456  456  456  456  456  456  456  456  190  456  456  456  456
##  [309]  456  190  456  456  456  456  456  456  456  456  456    6    6    6
##  [323]    6    6    6  456  456  456  456  456  456  456  456  456  456  456
##  [337]  456  456  456  456  456  456  456  456  456  456  456  456  456  456
##  [351]  456  456  456  456  456  456  456  456  456  456  456  456  456  456
##  [365]  456  456  456  456  456  456  456  456  456  456  456  456  456  456
##  [379]  456  456  456  456  456  456  456  456  456  456  456  456  456  456
##  [393]  456  456  456  456  456  456  456  456  456  456  456  456  456  456
##  [407]  456  456  456  456  456  190  456  456  456  456  456  456  456  456
##  [421]  456  456  456  456  456  456  456  456  456  456  456  456  456  456
##  [435]  456  456  456  456  456  456  190  456  456  456  456  456  456  456
##  [449]  456  190  456  190  456  456  456  456  456  456  456  456  190  456
##  [463]  190  190  190  456  456  456  190  456  456  456  190  190  190  456
##  [477]  456  456  456  190  456  190  456  190  190  456  456  456  456  456
##  [491]  456  456  456  190  456  190  456  456  456  190  190  190  190  190
##  [505]  456  190  190  190  456  190  190  190  190  190  190  190  190  456
##  [519]  190  190  456  456  190  456  456  456  456  456  190  456  456  456
##  [533]  456  456  456  190  456  456  456  456  456  456  456  456  456  456
##  [547]  456  456  456  456  190  456  456  456  456  190  190  190  456  456
##  [561]  456  456  456  456  456  456  456  456  190  456  456  456  190  456
##  [575]  456  456  456  456  190  456  456  456  456  190  190  190  190  190
##  [589]  190  190  190  190  190  190  190  190  190  456  456    9  456  456
##  [603]  456  456  456  456  456  456  456  456  456  456  456  190  190  456
##  [617]  456  456  456  456    9  456    9    9    9    9    9    9    9  456
##  [631]  456  190  190  456  190  190  190  456  190  190  456  190  190  190
##  [645]  190  190  190  190  190  190  190  190  190  190  445  445  445  190
##  [659]  445  445  445  445  445  445  190  190  190  190  190  190  445  445
##  [673]  445   15   15   15   15   15   15   15   15   15   15   15   15   15
##  [687]   15   15   53   12   12   12   12   12   12   12   12   12   12  506
##  [701]  506  506   12   12  506  506   53   53  506  506  506  506  506  506
##  [715]  506  506  506  506  506  506  506  506  506  506  506  506  506  506
##  [729]  506  506  506  506  506  506  506  506  506  506  506  506  506  506
##  [743]  506  506  506  506  506  506  506  506   53   53   53   14   53   53
##  [757]   53   53  506   53   53   53   53   53  506   53   53   53   53   53
##  [771]   53   53  506   53   53   53   53   53  506  506  506   53   53  506
##  [785]   53  506   53   53   53   53   53  506   53   53   53   53  506  506
##  [799]   53  506  506  506  506  506   14  506  506   53   53  506  506  506
##  [813]   53  506   53   53  506   53  506   53  506   53  506  506   53  506
##  [827]   53  506  506  506  506   53  506  506  506   53  506  506  506  506
##  [841]  506  152   53   14   14   14  506  506  506  506  506  506  506  506
##  [855]   14  506  506  506  506  152  506  506  506  506  506  506   14  506
##  [869]   14  506   14   14  152  506   14  152  506  506   14   14  506  506
##  [883]   14  506  506  506  506  506  506  506  506  506  152  506  506  506
##  [897]  506  506  152  506  506  506  506  506  506  506  506  506  506  506
##  [911]  506  506  506  506  506  506  506  506  506  506  506  506  506  506
##  [925]  506  506  506  506  506  506  506  506  506  506  506  506  506  506
##  [939]  506  506  506  506  506  506  506    3    3    3  506  506  152  506
##  [953]  506  506  506  506  506  506  506  506  506  506  506  506  506  506
##  [967]  506  152  506  506  506  506  152  506  152  506  445  445  506  445
##  [981]  506  506  506  506  506  506  152  152  506  152  152  152  152  506
##  [995]  152  506  506  506  506  445  506  506  506  506  506  506  506  152
## [1009]  506  506  506  506  506  506  506  506  506  506  506  506  506  506
## [1023]  506  152  506  506  506  506  506  152  506  506  152  506  506  152
## [1037]  506  506  506  506  506  506  506  506  506  506  506  506  506  506
## [1051]  506  506  506  506  506  152  506  506  506  506  506  506  506  506
## [1065]  506  506  152  506  506  506  506  506  506  445  506  445  506  506
## [1079]  506  506  506  152  152  152  152  152  152  506  506  506  152  506
## [1093]  506  152  152  152  506  152  152  152  506  152  506  506  152  506
## [1107]  152  506  506  506  506  506  506  152  506  445  445  445  152  152
## [1121]  152  506  152  506  152  152  506  506  152  506  506  152  152  152
## [1135]  506  152  506  506  152  506  152  506  506  506  152  152  152  506
## [1149]  445  152  506  152  506  445  506  445  152  506  506  506  445  506
## [1163]  152  152  152  152  152  445  506  506  506  445  506  152  506  152
## [1177]  445  506  506  152  506  506  506  445  506  506  506  152  506  506
## [1191]  506  445  152  152  506  445  506  152  506  152  152  506  506  445
## [1205]  445  445  506  506  506  506  506  445  152  152  445  445  152  445
## [1219]  152  445  445  152  152  445  152  445  152  152  445  152  152  152
## [1233]  152  152  152  152  445  152  152  152  152  152  152  506  506  152
## [1247]  152  152  506  506  152  152  445  152  445  152  445  506  506  152
## [1261]  152  152  445  506  506  152  445  152  152  506  506  445  506  152
## [1275]  152  445  506  152  506  506  506  506  445  506  152  506  506  506
## [1289]  506  506  152  506  506  152  445  445  506  506  506  506  445  152
## [1303]  506  445  445  506  445  445  445  445  445  445  445  152  445  445
## [1317]  506  506  445  506  506  506  506  445  152  152  506  445  506  152
## [1331]  506  445  152  506  506  506  506  506  445  445  445  152  506  506
## [1345]  506  445  445  445  506  152  506  506  445  445  152  152  506  506
## [1359]  152  506  506  506  506  506  152  152  506  506  152  445  506  152
## [1373]  152  506  152  506  445  152  152  152  506  506  152  506  152  506
## [1387]  506  445  152  445  506  506  506  445  152  506  506  445  445  445
## [1401]  445  152  445  445  445  445  445  445  152  506  152  506  445  445
## [1415]  445  445  152  152  152  445  445  445  506  506  152  445  506  152
## [1429]  445  506  506  445  152  445  152  445  445  506  445  506  506  445
## [1443]  506  445  506  506  506  445  506  152  445  445  445  445  445  152
## [1457]  445  506  445  506  506  506  506  445  506  445  506  506  506  445
## [1471]  152  445  506  445  506  445  506  506  445  445  445  445  445    1
## [1485]  445  506  506  506  506  506  445  445  506  445  445  445  506  506
## [1499]  445  506  506  445  445  445  445  506  445  445  506  506  506  506
## [1513]  445  506  506  506  506  506  445  506  445  506  506  506  506  506
## [1527]  506  445  506  445  506  506  506  506  506  506  506  506  506  506
## [1541]  506  445  445  506  445  445  445  445  445  445  445  445  506  506
## [1555]  445  445  445  445  445  445  445  445  445  445  445  445  445  506
## [1569]  445  445  445  445  445  445  445  445  445  445  445  445  445  445
## [1583]  445  506  445  445  445  445  445  506  445  445  445   82  445  445
## [1597]  445  506  506  445  445  445  445  506  445  445  506  506  445  445
## [1611]  506  445  445  445  445  445  506  506  445  445  506  445  445  445
## [1625]  445  445  445  445  445  445  445  506   82  445  445   82  445  506
## [1639]  445   82  445  445  445  445  506   82  445   82  445  445   82  445
## [1653]  445  445  445  445  445  445   82   82   82   82   82   82   73   82
## [1667]   82   82   82  506  445   82  445  445  445  445   73   82   82  445
## [1681]  445  445  445  445  445  445  445  445  445  445  445   73   82   73
## [1695]   82   82  506  506   82  445   82   82   73   73  506   73   82   82
## [1709]   82   82   82   82   73   82  445   73   73  445   82  445   82  445
## [1723]   82   82   82   82   82  445  445   82   82   82   73   82  445  445
## [1737]   82   82   82   82   82   82  445   73   82   82   82   82   82   73
## [1751]   82   82   73  445   82   73   73   73   82   73   82   73  445   73
## [1765]   82  445   73   73  445  445  445   73  445   82   73   73  445  445
## [1779]  445  445   73  445   73   73  445   73   73   73  445  445  445  445
## [1793]   73  445  445  445   73  445   73  445   82   73  445  445   73  445
## [1807]  445  445  445  445   73  445   73   73  445  445   73  445  445  445
## [1821]  445   73   73   73  445   73  445  445  445  445  445   82   73   73
## [1835]  445   82  445   82  445  445   73  445  445  445  445  445   82  445
## [1849]   73   73   73   82   73  445   82  445   73   82  445  445   82   73
## [1863]   44  445  445  445   44   44   44   44   44  445   82  445  445   73
## [1877]  445   82  445  445  445   73  445  445  445  445   73  445   82  445
## [1891]   82  445   73  445   82   73   82   82   82   82  445  445  445   73
## [1905]  445   44  445   44  445  445  445  445  445    7   73   73   73  445
## [1919]  445   73  445  445  445  445  445   73    7   73   82   82   35  445
## [1933]   73  445   73  445  445   35  445  445  411  411  445  445    7  445
## [1947]    7  445  445    7   35  445   44  445  445  445  445    7  445  445
## [1961]  445  445  411  445  445   35  445  445  445  445  445   73  411  445
## [1975]    7  445  411  445  445   73  411  411  445  411  411  445  411   73
## [1989]  411  445   73   73   35  411   35  411   35  411  411  411  445  445
## [2003]  411  445  411  411  411  411  411  445  411  411  445  411  577  445
## [2017]  445  445  411  411  411  445  411  411  411  445  445   73   35  411
## [2031]  445  411  445  411  411  577   35  411  411  411  411  411  411  411
## [2045]  411  445  411  445  411  411  577  411  411  411  411  411  445  411
## [2059]  411   35  411  411  411  411  411  411  445  577  445  411  577  445
## [2073]  445  411  445  577  577  577  445  411  445  445  577   35  411  411
## [2087]   35  411  411   35  411  411  577  445  577  577  411  445  411   35
## [2101]  411   35  411   35  411  411  411  411  577  411  577  577  577  577
## [2115]  411  577  445  577  445  411  445  411  445  577  411  577  445  411
## [2129]  577  411  411  577  411  411  577  411  411  411   44   44  577  411
## [2143]  577  411  577  411  411  411  411  445  577  577   35  577  411  577
## [2157]  577  577  577  411  411  411  411  411  411  411  577  411  411  411
## [2171]  411  577  577  577  577  577  577   35   35  411  577  411  411  411
## [2185]   35  411  577  411  411  411  577  577  411  577  577  577  411  411
## [2199]  577  411  411  411  577   35  411  411  411  577  577  411  411   35
## [2213]   35  411   35   35  411  411  577  411  577   35  411   35  577  411
## [2227]  411  577  577   35  577  577  577  411  411  411  411  411  411  411
## [2241]  411  411  411  411   35  411  445  445  577  445  411  577  577  445
## [2255]  411  411  577  411  577  411  577  411  411  445  577  411  577  577
## [2269]  577  577  577  577  577  411   35  577  577  577  411  411  577  411
## [2283]   35  577  577  577  577   35   35  577  577  577  411  577  577  577
## [2297]  577  411  577  411  577  411  577  411  411  577  577  411  577  577
## [2311]  577  411  577  577  577  411  411  411  411  577  411  577  445  411
## [2325]  411  577  577  577  411  411  411  577  577  411  411  411  577  411
## [2339]  577  577  411  577  577  577  577  577  577  577  411  577   35  577
## [2353]   35  577  411  577  577  445  411  445  445  445  411  577  445  411
## [2367]  577  411   44  411  577  577  577  577  411  411  577  577  577  577
## [2381]  577  149  577  577   44  411   44  149   44  577  149   44  149  411
## [2395]  577  577  577  411  411  411  411  577  577   44  577  577  411  577
## [2409]  411  411  149  411  411  577  411  411  577  577  149  577  411  577
## [2423]  577  577  411  411  577  577  577  577  577  577  577  149  411    8
## [2437]  411  577  411  411  411  411  411  577  411  411  411  411  411  411
## [2451]  411  411  577  411  577  577  411  149  577  577  411  411  411  577
## [2465]  411  411  411  411  577  411  577    8    8  577   44  577  577  577
## [2479]  411  411  411  577  577  577  411  577  411  577  411  577  577  577
## [2493]  577  577   44  577  411  577    8   44   44  577  411  577   44  411
## [2507]  577  411  411  577  411  411   44   44  411  577  411  577  577  577
## [2521]  577  577  411  577  411  577   44  411   44  411  411  577   44  577
## [2535]  577  411  411   44  411  577  411  411   44   44   44   44  577  149
## [2549]  149  411   44  411  577  411  577  411  411  577  411  411  577   44
## [2563]  577   44  411  577   44   44  411  577  577  411  411  411  411  411
## [2577]  577  577  577  577   44   44  577  411  577  577  411  411  411  577
## [2591]  411  577  411  577  411  411  411  411  577   44  411   44  577  577
## [2605]   44  577  577  577  577  577  577  411  577  577  411  411  577  577
## [2619]  577  411  577  577  577  577  577  577  577  577  411  577  577  411
## [2633]  577  577  577  577  577  411  577  411  149  577  149  149  411  577
## [2647]  149  411  577  411  411  577  411  577  149  577  411  577  577  411
## [2661]   44  411  577  411  411  411  577  411  411  577  577  577  577  411
## [2675]  577  149  149  577  577  411  411  577  411  577  577  577  577  577
## [2689]  577  577  577  577  411  411  577  577  577  411  577  577  411  411
## [2703]  577  577  577  577  149  577  577  577  411  149   50  411  577   50
## [2717]  411  411  577  411   50   50  411  577  577  577  411  149  411  411
## [2731]  411  411  411  577   50  411  577  411  411  411  411  149  411  411
## [2745]  577  411  411  577  577  411  577  577  577  577  577  577  577  577
## [2759]  577  411  577  577  577  411  411  577  577  577  411  577  577  577
## [2773]  577  411  577  577  577  577  577  411  577  577    8  411  577  411
## [2787]  577  577  577  411  411  577  577  577  577  577  577  411  577  411
## [2801]  577  577  411  411  577  411    8  577  577  411    8  577    8  577
## [2815]  577  577   50  411  577  577   68  577  577  577  577  577  577  577
## [2829]  577  577  411   68   50   50  577   50  411  577  577  149  149  411
## [2843]  149  411   50  411   50  577   50  411  577  411  577  577  577  577
## [2857]  577  149  411  577  149  411   50  577   50   50   50  411  577  577
## [2871]  149  577  149  411  411  149  149  149  411  411  411  149  149  149
## [2885]  577  411  577   50   50  149   50  149  411  577  577  577  577  411
## [2899]  577  411  577  577  411  577   68  577  577  577  411  577  577  577
## [2913]  577  411  411  411  411  577  577  577  411   68   68   68  577  577
## [2927]  411  577   68  411  411  411  577   68  411  577  411  411  411  411
## [2941]  411  577  411  577  577  149   50  577  149  577  577  411  577  411
## [2955]  577  577  577  411  577  149   68  577   68  577  149   68  411    9
## [2969]  149  149   68  577  577  411  577  577    9  577  411  411  411  577
## [2983]  411  577  577  149   13  411  577  149  411  577   68   68   68  411
## [2997]  577  577  149   68   68  411   68  577   68  577  577   68  577  577
## [3011]  577  577   50   68    4  577  577  577  577  577  577   68  577  577
## [3025]  577  149  149  577  577  411  577  149  149  577  577  577  577  149
## [3039]  577  411  577  149  149  577  149  149  149  149  411  577  411  577
## [3053]  577  577  577  577  577  149  577  411  577   68  138  577  577   68
## [3067]  577  577   68   68    4  149  577  149   68   68  149  411   68   68
## [3081]   68  577   13    4   68    9   68  149  577  411  149   68  411    4
## [3095]  138  149  411  577  138  577   68  149  577  577  577  411  149  577
## [3109]  149  577  577  577  577  577  577  411  577  149  577  149    9  149
## [3123]  149   68  138  411  577  577  577    9  138  577  149    9  149   68
## [3137]  149   68  577  149  149  577  577  149  138  138  577  149   68  577
## [3151]  149  138  577  149  577  138  577  149    9  577  577  577  577  149
## [3165]  577  138  149  149  149  149  577  149  149  149  138  149  149  577
## [3179]  149  149  138  577  149  149  577  577   13  138   68  149  149  149
## [3193]   50   50   50  149  577  577  577    9   13  149  577   68  149  138
## [3207]  577   68  149  577  577  149  577  577  149  577  577  149  149  149
## [3221]  577  577  149  149   13   13   68  149   13  577  149  577  577  149
## [3235]  149  138  138  149    9  577  149  149  577  577  138  577  577  149
## [3249]  577  138  577  577  577  149  149  149  149  149  149  149  138   68
## [3263]  149  149  149  149  149  149  149  149   13  149  149  577  149  149
## [3277]   13   13  149  149  149  577  149  149  149  149  577  577  577   50
## [3291]  577   23  577   68  577  577   68  149  138  138   68  577  149   68
## [3305]  138    9   68  149  138  577  149  577  577  138   23  577  577  149
## [3319]  577  577  138  577  577  149  138  138   25  138  138  149  138  577
## [3333]  577   25  577   25   25   50  138   50  577   68  577  577  577  577
## [3347]   68  577  577    9  577  577  577  138   50   68  577  577   68  577
## [3361]   68   23   25  577    9   50   50   23   50   50   23   68  138  138
## [3375]  577   50  577   25   23   68   50  577  577  577  138  577  577  577
## [3389]  138  577  138   13  138   50   50  577  577   25  577  138   50   50
## [3403]  577   50   50   50   50   50   25   68  577  138   13  577   25   68
## [3417]  577  577  577  138   23   23  138  138   68  138   68   68    9   68
## [3431]    9   68   23   68    9  138  138  138  138  138    9  138   68   68
## [3445]  138  138  138   68  138  138   25   50  138  138   68  138  138    9
## [3459]  138   13  138   50   23  138   25  138  138  138   23   25   23    3
## [3473]   23   44  138    9  138  138  138   23  138  138  138   23   25  138
## [3487]  138   23   25   23  138  138  138  138    3   25    3  138  138  138
## [3501]  138  138    5  138  138  138  138    5  138    5  138   25  138  138
## [3515]  138  138  138    5  138  138  138  138  138  138    5   23   25   25
## [3529]   50  138   25    5  138    5  138  138  138   23   50   23  138  133
## [3543]    5   23  138  138   25    5  138   11   50   11   25  138  138    4
## [3557]  138  138  138  138    1  138  138   15  138  133   15  133  133  138
## [3571]  133   25   25  138  138  133  133   11   11   32   11   11  138  138
## [3585]   11  138    4   32   23   11   23   11   23   23   11   32    9  133
## [3599]   15  133    5   25  133  133  138  138  138  138   32  138   50   50
## [3613]   25   32  138   15  133  138  133   11  133  138  133   32  138   23
## [3627]  133   23   23  133  133   23  133   23  133  138   23   26  133   26
## [3641]   23  138  138  133  133  138   26   26  133  138   26   15   26  133
## [3655]   26   26   23  133  133   26  133  133  133  133  133  133  133  133
## [3669]   23   23   23   23   23  133   23   23   23  133   32   23   23  133
## [3683]   23  133   32  133   23   15  133   23   15   15  133   32  133    4
## [3697]  133  133   32   15   15  133  133  133  133   15  133  133  133  133
## [3711]   26    9   32    9   26    9  133   26   26   32  133  133  133   32
## [3725]   32   32  133  133   32  133   32  133   15   15   15   15  133  133
## [3739]  133   32   32   26  133  133  133  133  133   26   32   32   26   32
## [3753]   32  133  133   26    9  133  133   26  133  133  133  133  133  133
## [3767]  133  133  133  133  133  133  133  133  133  133  133  133  133  133
## [3781]  133  102  102  133  133    9    9  102  133  133  102  133  133  133
## [3795]   26  133  102  102  102  133  102    9  133    4  102  133   26  133
## [3809]   26  133  102  133  133   26  133  133  102   32   32   32   32  133
## [3823]   26  133  133  102  133   26   26  133   26  133  133  133  102  102
## [3837]  133  133  133  133  102    9   32  102  133  133  133  133  133  133
## [3851]  133  133  102   32  133  102  133  102   32  102  102  102   32  102
## [3865]  102  102  102  102  102  102  102  102  102  102  102  102  102  102
## [3879]  102   32  102  102  102  102  102  102  102  102  102  102  102  102
## [3893]  102  102  102  102  102  102  102  102  102  102  102  102  102    3
## [3907]    3    3   88   88   88   88   88   88   88   88   88   88   88   88
## [3921]   88   88   88  102   88   88   88   88   88   88   88   88   88   88
## [3935]   88   88   88   88   88   88   88   88   88   88   88   88   88   88
## [3949]   88   88   88   88   88   88   88   88   88  102   88   88   88   88
## [3963]  102   88   88   88   88   88   88   88  102   88   88   88   88   88
## [3977]   88   88   88   88   88   88   88   88   88   88   88   88   88   88
## [3991]   88   88   88   88   88   88   88   88   88   88  102  102  102  102
## [4005]  102  102  102  102  102  102  102  102  102  102  102  102  102  102
## [4019]  102   14   14  102   14   14   14  102   14   14   14   14   14   14
## [4033]   14   14   14  102  102  102  102  102  102  102  102  102  102  102
## [4047]  102  102  102  445  445  445  445  445  445  445  445  445  445  445
## [4061]  445   17 1296 1296 1296 1296   17   17   17   17   17   17   17   17
## [4075]   17   17   17   17   17   17 1296 1296 1296 1296 1296 1296 1296 1296
## [4089] 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296
## [4103] 1296 1296 1296 1296   17   17 1296 1296 1296 1296 1296 1296 1296 1296
## [4117] 1296 1296 1296 1296 1296 1296 1296 1296   16   16   16   16   16   16
## [4131]   16   16   16   16   16   16 1296   16   16 1296 1296   16 1296 1296
## [4145]   16 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296
## [4159] 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296
## [4173] 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296
## [4187] 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296
## [4201] 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296
## [4215] 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296
## [4229] 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296
## [4243] 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296
## [4257] 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296
## [4271] 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296
## [4285] 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296
## [4299] 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296
## [4313] 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296
## [4327] 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296
## [4341] 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296
## [4355] 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296
## [4369] 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296
## [4383] 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296
## [4397] 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296
## [4411] 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296
## [4425] 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296
## [4439] 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296
## [4453] 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296
## [4467] 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296
## [4481] 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296
## [4495] 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296
## [4509] 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296
## [4523] 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296
## [4537] 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296
## [4551] 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296
## [4565] 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296
## [4579] 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296
## [4593] 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296
## [4607] 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296
## [4621] 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296
## [4635] 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296
## [4649] 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296
## [4663] 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296
## [4677] 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296
## [4691] 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296
## [4705] 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296
## [4719] 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296
## [4733] 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296
## [4747] 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296
## [4761] 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296
## [4775] 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296
## [4789] 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296
## [4803] 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296
## [4817] 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296
## [4831] 1296   70 1296   70   70   70 1296 1296   70 1296 1296 1296 1296 1296
## [4845] 1296 1296 1296 1296 1296 1296   70 1296 1296 1296 1296 1296 1296 1296
## [4859] 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296
## [4873] 1296   70 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296
## [4887] 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296
## [4901] 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296
## [4915] 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296
## [4929] 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296
## [4943] 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296
## [4957] 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296
## [4971] 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296
## [4985] 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296
## [4999] 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296
## [5013] 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296
## [5027] 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296
## [5041] 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296
## [5055] 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296
## [5069] 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296
## [5083] 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296
## [5097] 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296
## [5111] 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296
## [5125] 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296
## [5139] 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296
## [5153] 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296
## [5167] 1296 1296 1296 1296 1296 1296 1296 1296   61 1296 1296   61 1296 1296
## [5181] 1296 1296 1296 1296 1296 1296 1296   61   61 1296   61 1296 1296   61
## [5195]   61   61   61   61   61   61   61 1296 1296 1296 1296   61 1296 1296
## [5209] 1296 1296   61   61 1296 1296   61   61 1296 1296 1296   61 1296   61
## [5223] 1296 1296   61 1296   61 1296   61 1296 1296   61   61 1296 1296 1296
## [5237] 1296 1296 1296   61 1296 1296 1296 1296 1296   61 1296 1296 1296 1296
## [5251]   61   61   61 1296 1296 1296   61 1296 1296   61 1296 1296 1296 1296
## [5265]   61 1296 1296 1296 1296 1296   61 1296 1296 1296   61 1296 1296   61
## [5279] 1296   61 1296 1296 1296 1296 1296 1296 1296 1296   61   61   61   61
## [5293] 1296 1296 1296 1296 1296   61   61 1296 1296 1296 1296 1296   61 1296
## [5307]   61   61   61 1296 1296   61   61   61   61   61   61 1296   61   61
## [5321] 1296 1296   61 1296 1296 1296 1296 1296   61 1296 1296 1296 1296 1296
## [5335] 1296 1296   61 1296 1296 1296   61 1296 1296 1296   61 1296   61 1296
## [5349] 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296
## [5363] 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296
## [5377] 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296
## [5391] 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296
## [5405] 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296
## [5419] 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296
## [5433] 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296 1296
## [5447] 1296 1296 1296 1296 1296 1296 1296 1296   70   70 1296   70 1296 1296
## [5461] 1296   70   70   70   70   70   70   70   70   70   70   70   70   70
## [5475]   70   70   70   70   70   70   70   70   70   70   70   70   70   70
## [5489]   70   70   70   70   70   70   70   70   70   70   70   70   70   70
## [5503]   70   70   70   70   70   70   70   70   70   70   70   70   70   70
## [5517]   70   70   70   70   70
countryname_s_e=ifelse( merged_mr$country %in% countrywithppm[countrywithppm %in% merged_mr$country], paste(countryname, "*", sep = ""), countryname)

mergedmedian$countryfullname = paste0(countryname_s_e," (",merged_mr$count1 ,")")

bp2 <- ggplot(mergedmedian, aes(x=countryfullname, y=value_mean, group=country)) +
  labs(x = "Country", y = expression(paste("NO"[2], "  ", mu, "g/", "m"^3)), cex = 1.5) +
  geom_boxplot(aes(fill = median)) + 
  theme(text = element_text(size = 13), axis.text.x = element_text(angle = 90, hjust = 1)) +   scale_fill_distiller(palette = "Spectral")
#   scale_color_brewer(direction = 1)
 
print(bp2 + ylim(0, 100))

ggsave("boxplot.png")

Plot the paired correlation, for road predictors, population, Tropomi. For DE, CN, and world

merged_mr %>% na.omit %>% filter(country == "DE") %>%  ungroup() %>%dplyr::select(matches("value_mean|road_|night|trop")) %>% cor %>% corrplot(type = "upper", method = "pie", tl.cex = 0.7)

merged_mr %>% na.omit %>% filter(country == "US") %>%  ungroup() %>%dplyr::select(matches("value_mean|road_|night|trop")) %>% cor %>% corrplot(type = "upper", method = "pie", tl.cex = 0.7)

merged_mr %>% na.omit %>% filter(country == "CA") %>%  ungroup() %>%dplyr::select(matches("value_mean|road_|night|trop")) %>% cor %>% corrplot(type = "upper", method = "pie", tl.cex = 0.7)

merged_mr %>% na.omit %>% filter(country == "CN") %>%  ungroup() %>%dplyr::select(matches("value_mean|road_|night|trop")) %>% cor %>% corrplot(type = "upper", method = "pie", tl.cex = 0.7)

merged_mr %>% na.omit  %>%  ungroup() %>%dplyr::select(matches("value_mean|road_|night|trop")) %>% cor %>% corrplot(type = "upper", method = "pie", tl.cex = 0.7) 

Spatial dependency

grd_sp <- as_Spatial(locations_sf)
 
dt.vgm = variogram(value_mean~1, grd_sp, cutoff=   1)
plot(dt.vgm)

#Moran I test
#install.packages("ape", dependencies = TRUE)
#library(ape)

#merged_mrf =  merged_mr%>%filter(country == "US")
#no2.dists <- as.matrix(dist(cbind(merged_mrf$LONGITUDE, merged_mrf$LATITUDE)))
#no2.dists[1:5, 1:5]
#no2.inv <- 1/no2.dists
#diag(no2.inv) <- 0
#no2.inv[1:5, 1:5]
#Moran.I(merged_mrf$value_mean, na.rm = T, no2.inv) 
countryvariogram = function(COUN, cutoff){
loca =  locations_sf%>%filter(country == COUN)
grd_sp <- as_Spatial(loca)

dt.vgm = variogram(value_mean~1, grd_sp, cutoff = cutoff)
plot(dt.vgm)
}
countryvariogram("DE", 1)

countryvariogram("US", 1)

countryvariogram("CN", 1) # reason?

For the illustration purpose, we simply remove missing data, there are not many, 41. In practice, careful handling is needed to choose between removing missing data, imputation or handle them explicitly in the model.

inde_var = merged_mr %>%na.omit() # 70 variables
names(inde_var) 
##  [1] "road_class_M345_1000" "road_class_M345_100"  "road_class_M345_25"  
##  [4] "road_class_M345_3000" "road_class_M345_300"  "road_class_M345_5000"
##  [7] "road_class_M345_500"  "road_class_M345_50"   "road_class_M345_800" 
## [10] "long"                 "lat"                  "nightlight_800"      
## [13] "nightlight_500"       "nightlight_5000"      "nightlight_3000"     
## [16] "nightlight_1000"      "elevation"            "industry_1000"       
## [19] "industry_100"         "industry_25"          "industry_3000"       
## [22] "industry_300"         "industry_5000"        "industry_500"        
## [25] "industry_50"          "industry_800"         "OMI_mean_filt"       
## [28] "population_1000"      "population_3000"      "population_5000"     
## [31] "road_class_1_1000"    "road_class_1_100"     "road_class_1_25"     
## [34] "road_class_1_3000"    "road_class_1_300"     "road_class_1_5000"   
## [37] "road_class_1_500"     "road_class_1_50"      "road_class_1_800"    
## [40] "road_class_2_1000"    "road_class_2_100"     "road_class_2_25"     
## [43] "road_class_2_3000"    "road_class_2_300"     "road_class_2_5000"   
## [46] "road_class_2_500"     "road_class_2_50"      "road_class_2_800"    
## [49] "temperature_2m_10"    "temperature_2m_11"    "temperature_2m_12"   
## [52] "temperature_2m_1"     "temperature_2m_2"     "temperature_2m_3"    
## [55] "temperature_2m_4"     "temperature_2m_5"     "temperature_2m_6"    
## [58] "temperature_2m_7"     "temperature_2m_8"     "temperature_2m_9"    
## [61] "trop_mean_filt"       "wind_speed_10m_10"    "wind_speed_10m_11"   
## [64] "wind_speed_10m_12"    "wind_speed_10m_1"     "wind_speed_10m_2"    
## [67] "wind_speed_10m_3"     "wind_speed_10m_4"     "wind_speed_10m_5"    
## [70] "wind_speed_10m_6"     "wind_speed_10m_7"     "wind_speed_10m_8"    
## [73] "wind_speed_10m_9"     "value_mean"           "country"             
## [76] "count1"

Scatterplot

The scatter plots between predictors and mean NO2, Germany and global datasets. loess: moving regression, non-parametric, each smoothed value is given by a weighted linear least squares regression over the span.

inde_var %>% filter(country=="DE")%>%ungroup%>% dplyr::select(matches("road_class_M345|nightlight|temperature_2m_7|value")) %>% scatterplot(y_name = "value_mean", fittingmethod = "loess")

inde_var %>% filter(country=="DE")%>%ungroup%>% dplyr::select(matches("road_class_2|value")) %>% scatterplot(y_name = "value_mean", fittingmethod = "gam")

inde_var %>%ungroup%>% dplyr::select(matches("road_class_M345|nightlight|temperature_2m_7|value")) %>% scatterplot(y_name = "value_mean", fittingmethod = "loess")

inde_var %>% ungroup%>%  dplyr::select(matches("road_class_2|value")) %>% scatterplot(y_name = "value_mean", fittingmethod = "loess")

inde_var %>% ungroup%>% dplyr::select(matches("road_class_1|value")) %>% scatterplot(y_name = "value_mean", fittingmethod = "gam")

#inde_var %>% dplyr::select(matches("Tro|OMI|Rsp|night_value")) %>% scatterplot(y_name = "night_value", fittingmethod = "loess")
# why?

# can choose any variable to look at the scatterplot

#inde_var %>% dplyr::select(matches("ROAD_1|day_value")) %>% scatterplot(y_name = "day_value", fittingmethod = "gam")

Discussion 1, try different scatterplot and discuss about the findings

Modelling

If we use a single regression tree, the tree and prediction error will be different if shuffeling training and testing data. To reduce the variance, a method called “bagging” is developed, which agregate over (weak) voters. If the predictor variables are correlated, the reduction in variance is decreasing, Random Forest comes to resecue by ramdomly sampling variables.

XGBoost is surely popular, as it won the Kaggle (machine learning prediction) competition. It has a few features such as being scalable, but most importantly, it super-imposed a regularization to prevent model over-fitting. However, in my experience, though I often obtain the lowest RMSE or highest R2 with XGBoost, the spatial pattern it predicts look a lot worse than random forest, or simpler linear model with regularization – Lasso. It looks to predict lots of artefacts. We further compared various model prediction results with mobile sensor measurements (with a typical Dutch transporting tool “bakfiets”, which is a cargo-bike), and found XGBoost matches the least with the mobile sensor measurements! – No matter how I tune the model.

inde_var = inde_var%>%ungroup()%>%dplyr::select(-country)
str(inde_var)
## tibble [5,479 × 75] (S3: tbl_df/tbl/data.frame)
##  $ road_class_M345_1000: num [1:5479] 3178 35844 24580 17182 26042 ...
##  $ road_class_M345_100 : num [1:5479] 85.3 335.6 531.3 246.8 25 ...
##  $ road_class_M345_25  : num [1:5479] 0 0 90.4 13.2 0 ...
##  $ road_class_M345_3000: num [1:5479] 55187 192916 95691 28343 104357 ...
##  $ road_class_M345_300 : num [1:5479] 950 3804 4512 2504 2629 ...
##  $ road_class_M345_5000: num [1:5479] 85594 356023 158929 30837 148055 ...
##  $ road_class_M345_500 : num [1:5479] 1982 10701 11805 7304 8104 ...
##  $ road_class_M345_50  : num [1:5479] 0 17.5 179 104 0 ...
##  $ road_class_M345_800 : num [1:5479] 2598 25730 20351 12826 18050 ...
##  $ long                : num [1:5479] -159 -148 -135 -134 -129 ...
##  $ lat                 : num [1:5479] 21.9 64.8 60.7 68.4 54.5 ...
##  $ nightlight_800      : num [1:5479] 3.53 163.67 70.38 25.78 20.06 ...
##  $ nightlight_500      : num [1:5479] 2.81 158.01 80.8 32.05 19.36 ...
##  $ nightlight_5000     : num [1:5479] 2.01 72.71 14.92 1.88 4.25 ...
##  $ nightlight_3000     : num [1:5479] 3.2 108.04 28.4 4.96 9.51 ...
##  $ nightlight_1000     : num [1:5479] 3.73 165.51 62.6 22.04 19.92 ...
##  $ elevation           : num [1:5479] 22 133.7 638.6 27.1 71.4 ...
##  $ industry_1000       : num [1:5479] 0 23460 0 14790 47940 ...
##  $ industry_100        : num [1:5479] 0 0 0 0 0 0 0 0 0 0 ...
##  $ industry_25         : num [1:5479] 0 0 0 0 0 0 0 0 0 0 ...
##  $ industry_3000       : num [1:5479] 0 114750 37740 99705 378420 ...
##  $ industry_300        : num [1:5479] 0 510 0 0 0 0 0 0 0 0 ...
##  $ industry_5000       : num [1:5479] 0 880005 37740 139230 378420 ...
##  $ industry_500        : num [1:5479] 0 7650 0 0 0 0 0 0 0 0 ...
##  $ industry_50         : num [1:5479] 0 0 0 0 0 0 0 0 0 0 ...
##  $ industry_800        : num [1:5479] 0 23460 0 14535 7905 ...
##  $ OMI_mean_filt       : num [1:5479] 4.41e+14 6.72e+14 2.65e+14 3.15e+14 4.38e+14 ...
##  $ population_1000     : num [1:5479] 26126 181457 132975 98610 241644 ...
##  $ population_3000     : num [1:5479] 821134 1342751 689439 163661 839422 ...
##  $ population_5000     : num [1:5479] 1521328 1995845 1103387 163872 1030037 ...
##  $ road_class_1_1000   : num [1:5479] 0 2532 0 0 3443 ...
##  $ road_class_1_100    : num [1:5479] 0 0 0 0 0 0 0 0 0 0 ...
##  $ road_class_1_25     : num [1:5479] 0 0 0 0 0 0 0 0 0 0 ...
##  $ road_class_1_3000   : num [1:5479] 3210 46428 6830 0 9122 ...
##  $ road_class_1_300    : num [1:5479] 0 0 0 0 0 0 0 0 0 0 ...
##  $ road_class_1_5000   : num [1:5479] 8241 77651 12222 0 15208 ...
##  $ road_class_1_500    : num [1:5479] 0 0 0 0 0 ...
##  $ road_class_1_50     : num [1:5479] 0 0 0 0 0 0 0 0 0 0 ...
##  $ road_class_1_800    : num [1:5479] 0 143 0 0 2159 ...
##  $ road_class_2_1000   : num [1:5479] 0 4501 0 1712 0 ...
##  $ road_class_2_100    : num [1:5479] 0 0 0 0 0 0 0 0 0 0 ...
##  $ road_class_2_25     : num [1:5479] 0 0 0 0 0 0 0 0 0 0 ...
##  $ road_class_2_3000   : num [1:5479] 438 13162 0 7142 0 ...
##  $ road_class_2_300    : num [1:5479] 0 552 0 0 0 ...
##  $ road_class_2_5000   : num [1:5479] 4588 25168 0 11298 4157 ...
##  $ road_class_2_500    : num [1:5479] 0 1915 0 0 0 ...
##  $ road_class_2_50     : num [1:5479] 0 0 0 0 0 0 0 0 0 0 ...
##  $ road_class_2_800    : num [1:5479] 0 3565 0 1128 0 ...
##  $ temperature_2m_10   : num [1:5479] 23.623 0.379 -0.335 -3.538 4.551 ...
##  $ temperature_2m_11   : num [1:5479] 22.28 -13.15 -15.93 -17.43 -1.07 ...
##  $ temperature_2m_12   : num [1:5479] 20.52 -11.79 -15.41 -16.84 -5.06 ...
##  $ temperature_2m_1    : num [1:5479] 20.82 -21.27 -17.52 -19.63 -5.16 ...
##  $ temperature_2m_2    : num [1:5479] 20.88 -14.83 -15.18 -22.57 -6.11 ...
##  $ temperature_2m_3    : num [1:5479] 21.88 -17.19 -14.33 -20.37 -1.36 ...
##  $ temperature_2m_4    : num [1:5479] 22.468 1.433 -0.347 -11.176 4.933 ...
##  $ temperature_2m_5    : num [1:5479] 23.06 10.25 4.82 3.63 8.86 ...
##  $ temperature_2m_6    : num [1:5479] 24 16.9 10.5 10.5 12.1 ...
##  $ temperature_2m_7    : num [1:5479] 24.4 19.3 12.2 15.2 13.9 ...
##  $ temperature_2m_8    : num [1:5479] 24.9 14.8 12 13.6 16.1 ...
##  $ temperature_2m_9    : num [1:5479] 25.07 9.22 6.86 5.82 12.4 ...
##  $ trop_mean_filt      : num [1:5479] 35 109.4 36.5 18.5 33.3 ...
##  $ wind_speed_10m_10   : num [1:5479] 3.407 1.596 1.532 1.951 0.803 ...
##  $ wind_speed_10m_11   : num [1:5479] 4.07 1.709 1.359 2.156 0.957 ...
##  $ wind_speed_10m_12   : num [1:5479] 3.166 1.89 1.578 2.132 0.782 ...
##  $ wind_speed_10m_1    : num [1:5479] 3.197 1.759 1.602 2.429 0.864 ...
##  $ wind_speed_10m_2    : num [1:5479] 3.401 1.729 1.37 2.227 0.936 ...
##  $ wind_speed_10m_3    : num [1:5479] 2.81 1.56 1.46 2.34 1.03 ...
##  $ wind_speed_10m_4    : num [1:5479] 3.413 2.233 1.479 2.365 0.841 ...
##  $ wind_speed_10m_5    : num [1:5479] 3.298 1.623 1.909 2.405 0.921 ...
##  $ wind_speed_10m_6    : num [1:5479] 3.897 1.807 1.684 2.497 0.912 ...
##  $ wind_speed_10m_7    : num [1:5479] 3.991 1.472 1.466 2.828 0.842 ...
##  $ wind_speed_10m_8    : num [1:5479] 3.96 1.39 1.86 2.48 0.91 ...
##  $ wind_speed_10m_9    : num [1:5479] 3.006 1.708 1.635 2.326 0.735 ...
##  $ value_mean          : num [1:5479] 5.55 24.01 10.28 4.24 4.61 ...
##  $ count1              : int [1:5479] 456 456 190 190 190 190 190 190 190 456 ...

Hyperparameter tuning using R Caret package.

  • Here I only show how to tune hyper-parameters, detailed description and what hyper-parameter I tunned are in “Glo_hyp_tun.Rmb”
  • It is commonly (and strongly) recommended in deep learning to split the data into 3: model training; hyperparameter optimization; testing. The reason, is that the hyper-parameter optimization process may cause information leakage. However, separating a test dataset out may cause more severe bias. Actually, this question haunted me for a very long time since I started pulling out my results, and alsogenerated heated discussions between me and a reviewer of my recent global mapping paper. To reflect, [I wrote a discussion about this] (https://tomatofox.wordpress.com/2020/04/20/how-to-assess-accuracy-of-a-machine-learning-method-when-observations-are-limited/), comments appreciated!

Let’s come back to the model tunning, here I am showing to do this with the R package Caret, there are other packages,such as mlr3, but but Caret seems to be well-maintained, and is sufficient in most cases, and you can simply use ggplot on the caret::train object. All we need is to custom a tunning grid for tunning and control the resampling method.

Firstly, Random Forest, the most important hyperparameter are mtry - the number of variables sampled at each split, and min.node.size - the minimum number of observations at the leaf. The number of trees is also a hyperparameter, but it can be set as high as you like, as it will not cause model over-fitting as each tree is grown independently, which is different from boosting, which grows trees subsequently. Of course, you can also tune it.

[Try after the workshop as it takes quite a while; set the “eval =T”]

trl <- trainControl(method = "cv", number = 3) # control the resampling method
 tgrid <- expand.grid(
  .mtry = seq(7, 30, by = 3),
  .splitrule = "variance",
  .min.node.size = c(5, 10)
)
caret::train(value_mean ~ ., data =  inde_var , method='ranger', trControl  = trl, tuneGrid= tgrid) %>%ggplot()
#The final values used for the model were mtry = 25, splitrule = variance and min.node.size = 5.

In the same way, we can tune GBM [ Try after the workshop as it takes quite a while].

  • Note: we can use “bernoulli” for binary data and “gaussian” for continuous.

Tunning XGBoost is more complex as it has a lot more hyperparameters to tune: https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/

1 gamma[default=0][range: (0,Inf)] It controls regularization (or prevents overfitting). The optimal value of gamma depends on the data set and other parameter values. Higher the value, higher the regularization. Regularization means penalizing large coefficients which don’t improve the model’s performance. default = 0 means no regularization. Tune trick: Start with 0 and check CV error rate. If you see train error >> test error, bring gamma into action.

2. lambda and Alpha: similar to the Lasso Lambda, control the strength of regularization

3. nrounds[default=100] It controls the maximum number of iterations. For classification, it is similar to the number of trees to grow. Should be tuned using CV

4. eta[default=0.3][range: (0,1)] It controls the learning rate, i.e., the rate at which our model learns patterns in data. After every round, it shrinks the feature weights to reach the best optimum. Lower eta leads to slower computation. It must be supported by increase in nrounds. Typically, it lies between 0.01 - 0.3

5. max_depth[default=6][range: (0,Inf)] It controls the depth of the tree. Larger data sets require deep trees to learn the rules from data. Should be tuned using CV

xgboostgrid = expand.grid(nrounds = seq(300, 2000, by = 50), max_depth = 3:5, eta = seq(0.05, 0.2, by = 0.05),gamma =  1,
colsample_bytree = 1,
min_child_weight = 1,
subsample = 0.7) 
trainControl <- trainControl(method="cv", number=5, savePredictions = "final", allowParallel = T) #5 - folds
# train the model
train(value_mean~., data=inde_var, method="xgbTree", trControl=trainControl, tuneGrid =xgboostgrid)%>%ggplot()

Models

Regression tree

If you train a single train, e can see the tree is stochastic. But we can look at the tree structure to get some intuition of the model structure.

for (i in 2:5)
{
  set.seed(i)
  ctree_LUR(inde_var, y_varname= c("value_mean"), training = training, test =  test, grepstring = vastring)
}
Random Forest

Creates diverse set of trees because 1) trees are unstable w.r.t. changes in learning/training data (bagging) 2) randomly preselect mtry splitting variables in each split

ranger(value_mean~., data = inde_var)
## Ranger result
## 
## Call:
##  ranger(value_mean ~ ., data = inde_var) 
## 
## Type:                             Regression 
## Number of trees:                  500 
## Sample size:                      5479 
## Number of independent variables:  74 
## Mtry:                             8 
## Target node size:                 5 
## Variable importance mode:         none 
## Splitrule:                        variance 
## OOB prediction error (MSE):       49.76662 
## R squared (OOB):                  0.7445118

Gradient boosting

Here I am showing the “gbm” package. The “dismo” package extends “gbm” with the deviviance calculated from a distribution that can be chosen by users. Note though the dismo is built on gbm functions, the hyperparameters are slightly different.

gbm1 =  gbm(formula = value_mean~., data =inde_var, distribution = "gaussian", n.trees = 2000,
  interaction.depth = 6, shrinkage = 0.01, bag.fraction = 0.5 )
gbm1
## gbm(formula = value_mean ~ ., distribution = "gaussian", data = inde_var, 
##     n.trees = 2000, interaction.depth = 6, shrinkage = 0.01, 
##     bag.fraction = 0.5)
## A gradient boosted model with gaussian loss function.
## 2000 iterations were performed.
## There were 74 predictors of which 74 had non-zero influence.
summary(gbm1)

##                                       var      rel.inf
## trop_mean_filt             trop_mean_filt 32.973294803
## population_5000           population_5000 13.312029081
## population_3000           population_3000  7.670744460
## population_1000           population_1000  5.969546780
## long                                 long  2.521301026
## road_class_2_25           road_class_2_25  2.023068945
## road_class_2_50           road_class_2_50  1.948853462
## count1                             count1  1.458454851
## wind_speed_10m_3         wind_speed_10m_3  1.363854300
## road_class_1_50           road_class_1_50  1.283268764
## road_class_2_100         road_class_2_100  1.272433848
## road_class_1_25           road_class_1_25  1.221385331
## road_class_M345_3000 road_class_M345_3000  1.202173226
## nightlight_500             nightlight_500  0.942776171
## road_class_M345_100   road_class_M345_100  0.907852355
## road_class_M345_300   road_class_M345_300  0.879410071
## OMI_mean_filt               OMI_mean_filt  0.856574644
## road_class_M345_1000 road_class_M345_1000  0.837801013
## road_class_1_300         road_class_1_300  0.808311841
## road_class_M345_25     road_class_M345_25  0.783951258
## road_class_1_5000       road_class_1_5000  0.776877157
## road_class_2_5000       road_class_2_5000  0.776751669
## road_class_M345_5000 road_class_M345_5000  0.748310885
## elevation                       elevation  0.745164502
## road_class_1_100         road_class_1_100  0.732694471
## road_class_M345_50     road_class_M345_50  0.682614576
## wind_speed_10m_2         wind_speed_10m_2  0.678057838
## lat                                   lat  0.662619722
## road_class_2_3000       road_class_2_3000  0.659622742
## road_class_M345_500   road_class_M345_500  0.630982188
## road_class_M345_800   road_class_M345_800  0.626380098
## temperature_2m_10       temperature_2m_10  0.584711991
## temperature_2m_7         temperature_2m_7  0.529281613
## road_class_1_500         road_class_1_500  0.486717225
## temperature_2m_1         temperature_2m_1  0.476933371
## temperature_2m_12       temperature_2m_12  0.454651945
## industry_5000               industry_5000  0.450347550
## wind_speed_10m_1         wind_speed_10m_1  0.441206426
## nightlight_5000           nightlight_5000  0.427599111
## road_class_1_3000       road_class_1_3000  0.413775990
## wind_speed_10m_5         wind_speed_10m_5  0.396320098
## wind_speed_10m_12       wind_speed_10m_12  0.393395174
## wind_speed_10m_4         wind_speed_10m_4  0.393183892
## temperature_2m_2         temperature_2m_2  0.382312236
## temperature_2m_9         temperature_2m_9  0.368789200
## nightlight_3000           nightlight_3000  0.367468078
## temperature_2m_6         temperature_2m_6  0.354149746
## temperature_2m_5         temperature_2m_5  0.353706359
## wind_speed_10m_11       wind_speed_10m_11  0.352174035
## temperature_2m_3         temperature_2m_3  0.317900095
## temperature_2m_4         temperature_2m_4  0.317442393
## road_class_2_300         road_class_2_300  0.308608854
## road_class_1_800         road_class_1_800  0.302495037
## wind_speed_10m_6         wind_speed_10m_6  0.300725291
## nightlight_1000           nightlight_1000  0.290013803
## temperature_2m_8         temperature_2m_8  0.276787041
## nightlight_800             nightlight_800  0.270524572
## wind_speed_10m_8         wind_speed_10m_8  0.250839238
## industry_3000               industry_3000  0.245463261
## wind_speed_10m_10       wind_speed_10m_10  0.214014242
## road_class_1_1000       road_class_1_1000  0.203118969
## road_class_2_1000       road_class_2_1000  0.192431560
## wind_speed_10m_9         wind_speed_10m_9  0.179351651
## temperature_2m_11       temperature_2m_11  0.173251525
## wind_speed_10m_7         wind_speed_10m_7  0.141301098
## road_class_2_800         road_class_2_800  0.132437655
## industry_1000               industry_1000  0.089815161
## road_class_2_500         road_class_2_500  0.085675114
## industry_800                 industry_800  0.051163423
## industry_300                 industry_300  0.029480932
## industry_500                 industry_500  0.024365236
## industry_50                   industry_50  0.012322343
## industry_100                 industry_100  0.004666668
## industry_25                   industry_25  0.001918723
plot(gbm1, i.var = 2:3)

#plot(gbm1, i.var = 1) 
#rf_residual <- pre_rf -  rdf_test$NO2
Xgboost
 y_varname= "value_mean"
  x = inde_var%>%dplyr::select(-value_mean)
  df_x = data.table(inde_var, keep.rownames = F)
  df_y =  inde_var$value_mean 
  formu = as.formula(paste(y_varname, "~.", sep = ""))
  dfmatrix_x = sparse.model.matrix(formu, data = df_x)[, -1]
  train_b = xgb.DMatrix(data = dfmatrix_x, label = df_y) 
  
  max_depth = 4
  eta = 0.01
  nthread = 4
  nrounds = 800
  Gamma = 2
  
  bst <- xgboost(data = train_b, max_depth = max_depth, eta = eta, nthread = nthread, nrounds = nrounds, Gamma = Gamma, verbose = 1, print_every_n = 200, early_stopping_rounds = 50, maximize = F )
## [1]  train-rmse:27.466974 
## Will train until train_rmse hasn't improved in 50 rounds.
## 
## [201]    train-rmse:8.140881 
## [401]    train-rmse:6.307603 
## [601]    train-rmse:5.849174 
## [800]    train-rmse:5.614880
Emsemble multiple models

caretEnsemble is computational intensive. Can customize using the “emsemble.R”

#install.packages("caretEnsemble")
#library(caretEnsemble)

# Stacking Algorithms - Run multiple algos in one call.
#trainControl <- trainControl(method = "repeatedcv", 
#                             number = 10, 
#                             repeats = 2,
#                             savePredictions = TRUE, 
#                             classProbs = TRUE)

#algorithmList <- c('rf', 'adaboost', 'earth', 'xgbDART', 'svmRadial')

#set.seed(100)
#models <- caretList(value_mean ~ ., data = inde_var , trControl = trainControl, methodList = algorithmList) 
#results <- resamples(models)
#summary(results)

Important variables and Partial dependence plots

Variable importance and partial dependence plots are commonly used to interpret models.

pre_mat = subset_grep(inde_var , grepstring = paste0("value_mean|",vastring))
rf = ranger(value_mean~ ., data = na.omit( pre_mat), mtry = 25, num.trees = 1000, importance = "permutation")
rf
## Ranger result
## 
## Call:
##  ranger(value_mean ~ ., data = na.omit(pre_mat), mtry = 25, num.trees = 1000,      importance = "permutation") 
## 
## Type:                             Regression 
## Number of trees:                  1000 
## Sample size:                      5479 
## Number of independent variables:  70 
## Mtry:                             25 
## Target node size:                 5 
## Variable importance mode:         permutation 
## Splitrule:                        variance 
## OOB prediction error (MSE):       49.89808 
## R squared (OOB):                  0.743837
# ranger method
sort(importance(rf), decreasing = T)
##       trop_mean_filt      population_5000      population_3000 
##         87.479677560         31.840708559         27.604229420 
##      population_1000     wind_speed_10m_3    road_class_2_5000 
##         19.029555017         12.476498167         12.186572080 
##     wind_speed_10m_2     wind_speed_10m_5     temperature_2m_5 
##         10.477857453          5.574277917          5.108817095 
##     wind_speed_10m_1     temperature_2m_7     wind_speed_10m_9 
##          4.826213110          4.726529811          4.626828489 
##     temperature_2m_2      road_class_2_50      road_class_2_25 
##          4.462547662          4.010641339          3.979302991 
##     wind_speed_10m_8    road_class_1_5000     wind_speed_10m_4 
##          3.966512728          3.904471469          3.890500453 
##     temperature_2m_1    road_class_1_3000    temperature_2m_10 
##          3.719995228          3.417720714          3.320787435 
##    road_class_2_3000    temperature_2m_12    wind_speed_10m_10 
##          3.318954016          3.297070843          3.163951384 
##      nightlight_5000     road_class_2_100     temperature_2m_9 
##          3.116675259          3.031329319          3.023502834 
##     temperature_2m_6     wind_speed_10m_6      nightlight_3000 
##          2.958142362          2.857796228          2.827216988 
##       nightlight_500     temperature_2m_4    wind_speed_10m_12 
##          2.788353501          2.744054364          2.719850351 
##       nightlight_800    temperature_2m_11     temperature_2m_3 
##          2.663034377          2.630197798          2.468464708 
##    wind_speed_10m_11     temperature_2m_8      nightlight_1000 
##          2.443370781          2.438949700          2.413829774 
##      road_class_1_50 road_class_M345_3000 road_class_M345_5000 
##          2.118428493          2.011790337          1.991108264 
##  road_class_M345_100  road_class_M345_300     wind_speed_10m_7 
##          1.955832054          1.939972655          1.807111172 
##      road_class_1_25     road_class_1_100            elevation 
##          1.611340545          1.557479081          1.526347419 
##  road_class_M345_500     road_class_1_300   road_class_M345_50 
##          1.489783411          1.442673354          1.371210077 
## road_class_M345_1000        industry_5000        industry_3000 
##          1.288327323          1.208903559          1.203999518 
##     road_class_2_300     road_class_1_500  road_class_M345_800 
##          1.166187828          1.165422896          1.132805986 
##   road_class_M345_25     road_class_1_800     road_class_2_500 
##          1.109183906          0.752390778          0.725377335 
##    road_class_2_1000    road_class_1_1000     road_class_2_800 
##          0.615055018          0.529721179          0.510319982 
##        industry_1000         industry_800         industry_300 
##          0.304580831          0.177394306          0.040891348 
##         industry_500         industry_100          industry_25 
##          0.032715080          0.014978036          0.002870220 
##          industry_50 
##          0.001575499

[Try after workshop as it takes a while: Variable importance and partial dependence plots can be calculated and presented using vi and sparklines packages, which include more matrices and presentation functionalities. ]

set.seed(2)
vip::list_metrics()

DF_P_r2 = vi(rf, method = "permute", target = "value_mean", metric = "r2" ) # very clear what method is used and what is the target
DF_P_rmse = vi(rf, method = "permute", target = "value_mean", metric = "rmse") 
vip (DF_P_r2)
vip (DF_P_rmse) 
a=add_sparklines(DF_P_r2, fit = rf)
saveWidget(a, file="sparkline.html")

Use a subset of variables in RF to inspect partial denpendence.

pre_mat_s = inde_var%>%na.omit%>%ungroup() %>% dplyr::select(value_mean, road_class_2_50, nightlight_500, population_5000, road_class_M345_1000) 
lm_s = lm(value_mean~., data = pre_mat_s)
rf_s = ranger(value_mean~ ., data = pre_mat_s, num.trees = 1000, importance = "permutation")
rf_s
## Ranger result
## 
## Call:
##  ranger(value_mean ~ ., data = pre_mat_s, num.trees = 1000, importance = "permutation") 
## 
## Type:                             Regression 
## Number of trees:                  1000 
## Sample size:                      5479 
## Number of independent variables:  4 
## Mtry:                             2 
## Target node size:                 5 
## Variable importance mode:         permutation 
## Splitrule:                        variance 
## OOB prediction error (MSE):       101.02 
## R squared (OOB):                  0.4813909

Partial dependence is the marginal effect one or two features have on the predicted outcome of a machine learning model. It is the integration of all the other predictors given each value of the variable under inspection. Partial dependence plot is calculated based on the assumption that the correlation between the variable to be inspected and variables to be integrated to be low. For example, if we are interested in the effect of population within 1000 m buffer, but we integrate over population within 3000 m buffer, then high population region as is shown with the 1000 m buffer data may include very low popolation within 3000 m buffer, which is very unlikely.

pre_mat_predictor = pre_mat_s%>%dplyr::select(-value_mean) 
ggpairs(pre_mat_predictor)

The partial dependence plots of linear and random forest regression

p_lm = partial(lm_s, "road_class_M345_1000",plot = TRUE, rug = TRUE)
plot(p_lm) # Linear regression

p2 = partial(rf_s, "road_class_M345_1000",plot = TRUE, rug = TRUE)
plot(p2) # random forest

We can also inspect 2D and 3-D PDP plots for more in-depth insights of the joint effects from variables. [Try afterwork shop as it will take a wile]

#slow
pd <- partial(rf_s, pred.var = c("population_3000", "road_class_M345_1000"  ))

# Default PDP
pd1 = plotPartial(pd)

# Add contour lines and use a different color palette
rwb <- colorRampPalette(c("red", "white", "blue"))
pdp2 = plotPartial(pd, contour = TRUE, col.regions = rwb)
 
#3-D surface
 pdp3 <- plotPartial(pd, levelplot = F, zlab = "road_class_2_50", colorkey = T, 
                   screen = list(z = -20, x = -60) )

p3 = partial(rf_s, "road_class_2_50", plot = TRUE, rug = TRUE)
p1 = partial(rf_s, "population_3000", plot = TRUE, rug = TRUE)
grid.arrange(p1, p2, p3, pd1, pdp2, ncol = 2)